Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Uniformize model processors (models w/o special arg names) #32845

Open
wants to merge 25 commits into
base: main
Choose a base branch
from

Conversation

leloykun
Copy link
Contributor

@leloykun leloykun commented Aug 16, 2024

What does this PR do?

  • Uniformizes kwargs for processors of AltCLIP, Flava, Git, InstructBlipVideo, LLaVa-NeXT-Video, MGP, Siglip, TVP, VideoLLaVa, VILT, X-CLIP as discussed in Uniform kwargs for processors #31911

TODO:

  • add tests
    • add tests for AltCLIP
    • add tests for Flava
    • add tests for Git
    • add tests for InstructBlipVideo
    • add tests for Llava-NeXT-Video
    • add tests for MGP
    • add tests for Siglip
    • add tests for TVP
    • add tests for VideoLLava
    • add tests for VILT
    • add tests for X-CLIP

Fixes # (issue)

Who can review?

@zucchini-nlp @molbap @NielsRogge

@leloykun leloykun mentioned this pull request Aug 16, 2024
40 tasks
Comment on lines 123 to 137
if output_kwargs["text_kwargs"].get("visual_prompt") is not None and audio is not None:
raise ValueError(
"You cannot provide `visual_prompt` as a positional argument and as a keyword argument at the same time."
"Please provide it only as a keyword argument (i.e. `visual_prompt=...`)."
)
if "visual_prompt" not in output_kwargs["text_kwargs"]:
warnings.warn(
"No `visual_prompt` kwarg was detected. The use of `visual_prompt` as an argument without specifying it explicitely as `visual_prompt=` will be deprecated in future versions."
)
# For backwards compatibility, we reuse `audio` as `visual_prompt` in case
# downstream users passed it as a positional argument
if audio is not None:
output_kwargs["text_kwargs"]["visual_prompt"] = audio

visual_prompt = output_kwargs["text_kwargs"].pop("visual_prompt", None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same remark as in #32841 , we should not be using kwargs for a purpose other than the one advertised!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@molbap , I just moved the models w/ special arg names to this PR: #32841 so they'd be in one place.

This PR should now be ready for review.

@leloykun leloykun marked this pull request as draft August 16, 2024 09:44
@leloykun leloykun force-pushed the fc--uniformize-kwargs-the-rest branch from 8a2a2c8 to 0e8e99e Compare August 16, 2024 14:16
@leloykun leloykun changed the title Uniformize the kwargs for processors of ClipSeg, InstructBlipVideo, LLaVa-NeXT-Video, Owl, VideoLLaVa Uniformize the kwargs for processors of ClipSeg, InstructBlipVideo, LLaVa-NeXT-Video, Owl, Siglip, VideoLLaVa Aug 16, 2024
@leloykun leloykun marked this pull request as ready for review August 16, 2024 16:17
…lipvideo, llava_next, llava_next_video, siglip, video_llava, vilt
@leloykun leloykun force-pushed the fc--uniformize-kwargs-the-rest branch from 1380596 to ce9cc73 Compare August 17, 2024 06:19
@leloykun leloykun changed the title Uniformize the kwargs for processors of ClipSeg, InstructBlipVideo, LLaVa-NeXT-Video, Owl, Siglip, VideoLLaVa Uniformize the kwargs for processors of AltClip, InstructBlipVideo, Flava, LLaVa-NeXT-Video, Siglip, VideoLLaVa, VILT Aug 17, 2024
@leloykun leloykun changed the title Uniformize the kwargs for processors of AltClip, InstructBlipVideo, Flava, LLaVa-NeXT-Video, Siglip, VideoLLaVa, VILT Uniformize model processors (models w/o special arg names) Aug 17, 2024
@leloykun leloykun requested a review from molbap August 17, 2024 06:43
Copy link
Member

@zucchini-nlp zucchini-nlp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this awesome work! I left some general comments, mostly nits about cleaning up docstring and clarification questions for my own uderstanding.

Additionally, we have to return BatchFeature and not BatchEncoding in processor's call
and Chameleon needs to be removed from PR, as it now has its own PR

Comment on lines 96 to 99
images: Optional[VideoInput] = None,
text: Optional[Union[TextInput, PreTokenizedInput, List[TextInput], List[PreTokenizedInput]]] = None,
audio=None,
videos=None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, this is quite interesting since we accept videos as argument but we still leave it as images = VideoInput.

Oke, let's leave it for BC and I need to spend some time on making new video-models more standard, currently some work with images and some with videos. But that will not be soon I guess

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zucchini-nlp I've made these changes:

  • we now allow the user to use videos instead of images
  • using images now results in a future warning
  • using both videos and images now results in an error

Comment on lines +17 to +21
if isinstance(component_class_name, tuple):
if "_fast" in component_class_name[0]:
component_class_name = component_class_name[0]
else:
component_class_name = component_class_name[1]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not very clear why we had to overwrite this, is this because we need to work only with FastTokenizers?

If that's the case, usually there should be no difference in how text is encoded with fast vs slow tokenizers, so we might rather dig why there's a difference

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll make the Chameleon-related changes in this PR: #32181

After that gets merged, I'll rebase this to main

tests/test_processing_common.py Outdated Show resolved Hide resolved
tests/test_processing_common.py Show resolved Hide resolved
@leloykun
Copy link
Contributor Author

@zucchini-nlp I just added 4 more models:

  • Git
  • MGP
  • TVP
  • X-CLIP

Copy link
Member

@zucchini-nlp zucchini-nlp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect, thanks for adding them! I left some general questions but I think we will first focus and merge a smaller PR, to decide on how to handle edge cases we've been discussing

src/transformers/models/mgp_str/processing_mgp_str.py Outdated Show resolved Hide resolved
src/transformers/models/mgp_str/processing_mgp_str.py Outdated Show resolved Hide resolved
src/transformers/models/tvp/image_processing_tvp.py Outdated Show resolved Hide resolved
return_token_type_ids=False,
**kwargs,
)
textual_input = self.tokenizer(text, **output_kwargs["text_kwargs"])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this actually doesn't match with the removed lines. Previously we had some kwargs hardcoded, like trucation or padding and now users can set those to any value.

TBH I don't know why these were hardcoded but I'd leave it as is to not break anything. In that case we don't need the pad_to_max_length kwarg, as it is going to be padded to max-len no matter what

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved them to the default values in TvpProcessorKwargs. I think that'd suffice since this part would've error-ed out anyway if downstream users passed them as kwargs too (duplicate args error).

Comment on lines 127 to 133
return_tensors = output_kwargs["common_kwargs"].get("return_tensors")
if text is not None and videos is not None:
encoding["pixel_values"] = image_features.pixel_values
return encoding
return BatchFeature(data=dict(**encoding, **image_features), tensor_type=return_tensors)
elif text is not None:
return encoding
return BatchFeature(data=dict(**encoding), tensor_type=return_tensors)
else:
return BatchEncoding(data=dict(**image_features), tensor_type=return_tensors)
return BatchFeature(data=dict(**image_features), tensor_type=return_tensors)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can simplify this and other processors by doing only and assigning

encoding = image_features ={}
# process inputs if present here

BatchFeature(data=dict(**encoding, **image_features), tensor_type=return_tensors)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zucchini-nlp I've implemented it slightly differently, but it's now a lot cleaner than before

@yonigozlan @molbap you guys might wanna take a look too

Comment on lines +31 to +32
"videos" not in inspect.signature(self.processor_class.__call__).parameters
or inspect.signature(self.processor_class.__call__).parameters["videos"].annotation == inspect._empty
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this because of InstructBlipVideo? If it's only that, we better override this test for InstructBlipVideo and leave the check as "video_processor" in classes

When overriding for TVP, there's no need to check as we know if model has video processor or not. Actually seems like it's gonna be skipped because TVP has only image_processor

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for:

  • InstructBlipVideo
  • TVP
  • Video Llava, &
  • X-Clip

Only Llava-Next-Video actually has a video_processor_class

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm leaning on leaving this here cuz it covers more models than the original check + it significantly reduces LoCs

tests/test_processing_common.py Outdated Show resolved Hide resolved
@leloykun leloykun force-pushed the fc--uniformize-kwargs-the-rest branch from a44a76a to 9e00f68 Compare August 20, 2024 10:21
class TvpProcessorTest(ProcessorTesterMixin, unittest.TestCase):
from_pretrained_id = "Jiqing/tiny-random-tvp"
processor_class = TvpProcessor
videos_data_arg_name = "pixel_values"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zucchini-nlp just a heads up cuz I remember you mentioning that you plan on reworking the text-videos models:

the output keys for video data is inconsistent; some uses pixel_values_videos while some just uses pixel_values. You'd most likely need to uniformize/standardize them too

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's me who started using pixel_values_videos hehe, thanks for letting know :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants